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Abstract 

In this paper we propose a method for estimating depth 
from a single image using a coarse to fine approach. We 
argue that modeling the fine depth details is easier after a 
coarse depth map has been computed. We express a global 
(coarse) depth map of an image as a linear combination of a 
depth basis learned from training examples. The depth ba¬ 
sis captures spatial and statistical regularities and reduces 
the problem of global depth estimation to the task of predict¬ 
ing the input-specific coefficients in the linear combination. 
This is formulated as a regression problem from a holistic 
representation of the image. Crucially, the depth basis and 
the regression function are coupled and jointly optimized 
by our learning scheme. We demonstrate that this results 
in a significant improvement in accuracy compared to di¬ 
rect regression of depth pixel values or approaches learn¬ 
ing the depth basis disjointly from the regression function. 
The global depth estimate is then used as a guidance by a 
local refinement method that introduces depth details that 
were not captured at the global level. Experiments on the 
NYUv2 and KITTI datasets show that our method outper¬ 
forms the existing state-of-the-art at a considerably lower 
computational cost for both training and testing. 

1. Introduction 

Over the last few years depth estimation has been the 
subject of active research by the machine learning and com¬ 
puter vision community miniiiiEiiiigi. This can partly 
be attributed to the fact that algorithms using the depth 
channel as an additional cue have shown dramatic improve¬ 
ments over their RGB counterparts on a number of chal¬ 
lenging vision problems |29l|7l[l0l[IIl. Most of these im¬ 
provements have been demonstrated using depth measured 
by hardware sensors. However, most of the pictures avail¬ 
able today are still traditional RGB (rather than RGBD) 
photos. Thus, there is a need to have robust algorithms for 
estimating depth from single RGB images. 

While inferring depth from a single view is ill-posed in 
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general (an infinite number of 3D geometric interpretations 
can fit perfectly well any given photo), physical constraints 
and statistical regularities can be exploited to learn to pre¬ 
dict depth from an input photo with good overall accuracy. 
In this work we propose to learn these spatial and statistical 
regularities from a RGBD training set in the form a global 
depth basis. We hypothesize that the depth map of any im¬ 
age can be well approximated by a linear combination of 
this global depth basis. Following this reasoning we formu¬ 
late coarse depth estimation as the problem of predicting the 
coefficients of the linear combination from the input image. 
Our design choice makes this regression problem easier as 
the target dimensionality is much lower than the number of 
pixels and the output space is more structured. Crucially, we 
learn the depth basis and the regression model jointly by op¬ 
timizing a single learning objective. We denote our global 
estimation method as GCL (Global Coupled Learning). 

As input for our regression problem we use a holistic 
image representation capturing the coarse spatial layout of 
the scene. While in principle we could attempt to learn 
this holistic feature descriptor too, we argue that existing 
RGBD repositories are too limited in scope and size to be 
able to learn features that would generalize well to differ¬ 
ent datasets. Instead, we propose to leverage a pretrained 
global feature representation that has been optimized for 
scene classification 1^ . The intuition is that since these 
features have been tuned to capture spatial and appearance 
details that are useful to discriminate among a large num¬ 
ber of scene categories, we expect them to be also effective 
generic features for the task of depth prediction. Our exper¬ 
iments on two distinct benchmarks validate this hypothe¬ 
sis, showing that our models trained on these scene features 
yield state-of-the-art results (without any fine-tuning). 

Since our model is trained on a holistic description of the 
image, it can be argued that it is implicitly optimized to pre¬ 
dict the main global 3D structure in the scene, possibly at 
the expense of fine depth details. To address this potential 
shortcoming we propose a local refinement step, which uses 
the global estimate to improve the depth prediction at indi¬ 
vidual pixels. We refer to this refinement procedure as RCL 
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Table 1. Visualization of depth estimates from NYUv2. Notice how the local refinement (RCL) captures finer depth details compared to 
GCL, such as the corner of the bed and the objects in the background of the first photo example, or the bookcase in the second picture or 
the object on the bed and the corner of the room in the third picture. 


(Refined Coupled Learning). This is achieved by training a 
depth refinement function on hypercolumn features El of 
individual pixels, which describe the local appearance and 
context in the neighborhood of the pixel. Our experiments 
indicate that the local refinement quantitatively improves 
the global estimate and produces finer qualitative details. 
In Table we show the global (GCL) and locally-refined 
(RCL) depth outputs produced by our system for a few ex¬ 
ample images. 

2. Related Work 

While initial approaches to depth estimation exploited 
specific cues like shading ED and geometry O, more re¬ 
cently the focus has shifted toward employing pure machine 
learning methods due to the heavily restrictive assumptions 
of these earlier methods. Most of the earlier machine learn¬ 
ing based approaches (231 EH [121 operate in a bottom- 
up fashion by performing local prediction (e.g., estimating 
depth for individual patches or superpixels) and by spatially 
smoothing these estimates with a CRF or a MRF model. 
The advantage of local prediction models is that they can be 
trained well even with limited RGBD data since they treat 
each patch or pixel in the collection as a separate exam¬ 
ple. However, small regions do not capture enough context 
for robust depth estimation. In contrast, we approach depth 
estimation first at a global level by considering the entire 


image at once. Then we regress depth at a per-pixel level 
using the global estimate as a prior. 

With the advent of larger RGBD repositories (25l |28l 
there has been an increased interest in the use of nonpara- 
metric methods 113 da for depth estimation. These ap¬ 
proaches find nearest-neighbors of the query in the training 
set, and then fuse the depth maps of the retrieved neighbors 
to produce a depth estimate for the query. Such approaches 
do not generalize well unless the test set is collected in the 
same exact environment as the training set. This imposes 
large computational and memory constraints on their usage. 

Recently, there has been an increased interest in apply¬ 
ing deep learning methods (41120| [26]| for estimating depth 
from a single image. Most of these systems (H |26l at¬ 
tempt to regress depth directly from the image. This re¬ 
quires learning large models that can be trained effectively 
only with hundreds of thousands of examples. This renders 
these techniques inapplicable in areas where data is scarce. 
Liu et al. (20l proposed learning deep features for super¬ 
pixel depth prediction. A CRF is then applied to enforce 
global coherence over the entire depth map. While this ap¬ 
proach does work for smaller datasets, it is restricted to per¬ 
form coarse super-pixel predictions where each super-pixel 
is assumed to be facing the camera (has no depth gradient). 

Our approach critically differs from prior work in two 
fundamental aspects. First, our approach predicts a small 
set of depth reconstruction weights rather than the full depth 
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maps. Our design choice exploits statistical regularities 
in the problem and reduces the number of outputs to pre¬ 
dict. We demonstrate that this allows our method to achieve 
a much lower RMSE error than methods predicting depth 
maps directly (H |26l [20l, even when using 150 times less 
training data (on NYUv2). Second, our refinement model is 
trained to predict the depth at individual pixels using local 
pixel descriptors rather than super-pixels 1201 . Furthermore, 
we also show how to leverage features from deep networks 
trained on related tasks to further improve performance. 

Our joint optimization of depth basis and regression is 
inspired by prior work in semi-coupled dictionary learn¬ 
ing ll27l . Here we borrow this optimization scheme to per¬ 
form joint learning of a depth dictionary and a regressor 
from the image space to the basis weights in order to reduce 
the number of outputs to predict. While we focus on the the 
problem of depth estimation from single view, we believe 
that our approach can be used effectively in other scenarios 
involving dense pixel-level predictions under limited avail¬ 
ability of training data. 

3. Technical Approach 

In the following subsections we discuss how to jointly 
learn a global depth basis and a transformation from a given 
image space to the basis weights using training data. Then 
we discuss how to use the trained model to infer the coarse 
global depth of an image. Finally, we show how to further 
refine the coarse estimate with pixel-level predictions. 

Let V = {{Xi,Di ),..., (Xiv, Dn)} be the training set 
used to learn our model, where Xi G represents 

the i-th image (consisting of R rows, C columns and 3 color 
channels) and Di G is its associated ground-truth 

depth map. 

3.1. Global Depth Estimation 

3.1.1 Learning the Global Depth Model 

To learn the global depth model, we start by downsampling 
the ground truth training depth maps. This has the effect of 
removing fine depth details (object boundaries, fine gradi¬ 
ents denoting local shape, etc). We denote with di G 
the vector obtained by vectorizing the depth map Di after 
resizing to a lower resolution, where Pl represents the di¬ 
mensionality of the low resolution depth map. Similarly, we 
indicate with Xi G the vector obtained by stacking 

the pixel values of the image one on top of the other. Our 
objective is to train a model that, given an input image x (at 
full resolution), predicts the global depth map d. 

The first assumption we make is that the global depth 
map d can be expressed as a linear combination of basis 
vectors from a depth basis B = [6i,..., 6^]: 

d = Bw (1) 


where the bk G are the basis atoms and w = 

[rci,..., Wm]^ is the vector containing the image-specific 
mixing coefficients (or weights). Fig. shows both quanti¬ 
tatively as well as qualitatively the effect of varying the dic¬ 
tionary size on depth reconstruction. We propose to learn a 
mapping h : that predicts the depth recon¬ 

structive weights w from the input image x. Note that in our 
work m « Pl (e.g., m = 48 for NYUv2 and m = 96 for 
KITTI) and thus the use of the depth basis greatly reduces 
the number of outputs that the regression model needs to 
predict. To regress on w we choose a simple kernel-based 
regression model 

h{x) = T4>{x) ^ w (2) 

where T G M^xn cj){x) = [0i(cc), ..., is 

a vector containing radial basis functions 0j(x) computed 
with respect to centers Cj for j = 1, ... ,n. The cen¬ 
ters Cl ,... Cn are example images (different from those in¬ 
cluded in the training set V) and selected according to the 
details described in section |3.3| Intuitively, they represent 
n prototypical images that allow us to express ic as a lin¬ 
ear combination of kernel distances from x. We compute 
the radial basis functions in terms of feature descriptors 
/(cc), /(c) extracted from the images x, c. We use as image 
representation f{x) the features computed by layer 
of the deep network of the PLACES model (32. This is 
the max-pooled output of the fifth and final convolutional 
layer in the network. This feature map has dimensional¬ 
ity 6 x 6 x 256 = 9216. While prior work (61 [T^ 

has shown that the subsequent (fully connected) layers of 
the Krizhevsky ca network (same architecture, different 
dataset) produce higher level representations that yield im¬ 
proved recognition accuracy, pools is the most appropriate 
feature map to use in our setting since it is the last layer 
preserving explicit location information before the “spa¬ 
tial scrambling” of the fully connected layers. Note that 
a spaXmWy-variant representation is crucially necessary to 
predict the depth at each pixel. We validated experimentally 
this intuition and observed that using the feature maps from 
the fully-connected layers produced poorer depth prediction 
accuracy. Using this representation for feature vector f{x), 
we then compute (pjix) = exp(— ||/(cc) — /(cj)|p/2cr|). 

Given this model, a naive approach to training our depth 
estimator is to learn disjointly the depth basis and the re¬ 
gression mapping. This would involve first learning the ba¬ 
sis B and the weights w of Eq.[^(e.g., by minimizing the re¬ 
construction error on training depths di,..., dAr) and then 
regressing on these learned weights to estimate the trans¬ 
formation T of Eq. 1^ While straightforward, in our ex¬ 
periments we demonstrate that this two-step process yields 
much inferior results compared to a joint optimization over 
B,w, and T using a single learning objective that couples 
all of the parameters together. We refer to this learning ob- 
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Figure 1. (a) Reconstruction error (RMSE) of ground truth depth on NYUv2 for different dictionary sizes (note that this experiment did not 
involve depth prediction from images, just ground truth depth approximation). For NYUv2 we use a dictionary of size 48 as this provides a 
good compromise in terms of compactness and approximation quality, (b) Qualitative effects of different dictionary sizes on a sample depth 
map. From left to right: ground truth (GT) and least-square approximations using a dictionary with size m = 12, 24,48,96, respectively. 


jective as J(5, ii;, T) and define it as follows: 

N N 

J{B,w,T) = '^\\di - Bwi \\2 + Xw'^W'WiWi (3) 

i=l i=l 

N 

+ XrY,\\'^i-T<f>{Xi)\\2 + XT\\T\\F ■ 

i=l 

The first two terms of J encourage reconstruction of the 
depth maps using sparse weights and are equivalent to the 
terms of the traditional sparse coding objective IITtI . The 
third term imposes the requirement that the depth weights 
be “predictable” under the regression model. The final term 
is a regularizer over the transformation T. Thus, joint opti¬ 
mization of J over all parameters will yield a depth basis B, 
depth weights w, and transformation T that simultaneously 
minimize 1) sparse reconstruction of depths maps and 2) 
regression error from the image domain to the depth space, 
subject to appropriate regularizations. In practice we min¬ 
imize J{B,w,T) with the added constraints \\bj\\2 < 1 
for j = 1,..., n in order to avoid scale degeneracies on B. 
Furthermore, we enforce positivity constraints on the sparse 
weights Wij in order to define a purely additive depth model. 
We have found experimentally that this yields slightly bet¬ 
ter results than leaving the weights unconstrained. We also 
considered using an L2 sparsity over the weights Wi but 
found that this produces consistently slightly worse results, 
as also reported in prior articles (221. 

While our learning objective is not jointly convex over 
tu, 5, T, it is convex for each of these individual parameters 
when we keep the other two fixed. Based on this, we opti¬ 
mize our learning objective via block-coordinate descent by 
minimizing in turn with respect to 1) the basis, 2) the depth 
weights and 3) the transformation. These three alternating 
steps are discussed in detail below: 

1. Estimate weights w given parameters B, T. It is 

easy to verify that minimizing J with respect to w 
while keeping B, and T fixed (at the current estimate) 


reduces to a problem of the form: 

N N 

argrmn Ell cii — C7'mi||2 + Eii'^*iii (4) 

i=i i=i 

where , C are constants written in terms of B and 
T. We solve this problem globally via least angle re¬ 
gression (LARS) O. 

2. Learning the depth basis B given w, T. This 
amounts to a L2-constrained least-squares problem, 
which we solve using the Lagrange dual, as in Lee et 

al. nil. 

3. Learning the transformation T given B. This 
reduces to a L2-regularized least-squares problem, 
which can be solved in closed-form as shown in Wang 
et al. EH. 

We initialize this optimization by setting B and w to the 
solution computed via sparse coding (TtII . thus neglecting 
the terms in J depending on transformation T. We then 
compute T by solving step 3 above. Fig. shows the bases 
learned with this procedure on NYUv2 and KITTI. 

3.1.2 Global Depth Map Inference 

At inference time, given a new input image x, we com¬ 
pute its global depth map dP by finding the sparse depth 
weights w that best fit the image-based prediction, i.e., by 
solving the following optimization problem subject to posi¬ 
tivity constraints on the weights: 

argrmn I|iu - T(j){x )\\2 + A^||m||i . (5) 

The global depth map is then generated as dP = Bw. 

Empirically, we have found beneficial to apply the col- 
orization procedure described in Levin et al. (181 to the 
global estimate d^ produced by our approach. This tech¬ 
nique has been used in previous work (251 141 to fill-in miss¬ 
ing values in data collected by depth sensors. Here instead 
we use it to make the depth map more spatially coherent, as 
the colorization procedure encourages pixels having similar 
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Figure 2. The depth basis learned by our global model GCL on the 
NYUv2 dataset (left) and on KITTI (right). The basis is used to 
model the structure of the output space (depth), thus reducing the 
complexity of the regression problem. 


color to be mapped to similar values of depth. To do this, 
we first resize the low-resolution depth map dF via bilin¬ 
ear interpolation to an intermediate size of Pj pixels. Then 
we apply the colorization procedure using all pixels at this 
resolution as “color” propagation seeds with a low penalty 
value (the penalty value indicates how much the colorized 
depth values can deviate from the original input value). The 


details of this procedure are discussed in section 3.3 


3.2. Local Depth Refinement 

The local depth refinement uses the prediction d from 
our global depth model (described in the previous section) 
and generates a higher resolution, locally-refined depth map 
d^^ containing finer details. Let d^^ be the global depth 
estimate resized to the intermediate resolution Pj and post- 


processed via colorization as described in sub section 1 3.1.2 
Also, we denote with dj the ground truth depth map Di 
resized to size Pj and vectorized. We propose to train a 
local refinement model that predicts the depth of pixel j in 
example i using a local descriptor (xi) computed at pixel 

j, i.e., dlj ^ F • (/)J(x^), where F is a row vector encoding 
the model parameters. Note that this parameter vector is 
shared across pixels but, unlike our global depth estimator, 
the refinement predicts the depth at a pixel using as input a 
local descriptor computed at that pixel rather than the whole 
image. Specifically, we choose 


(kjiXi) = 


1 d^/ 


iT 


( 6 ) 


where dP is the depth estimate for pixel j in image i 
from our global model, which is used as additional fea¬ 
ture here in order to guide the local refinement. Thus, the 
global depth estimate acts in a sense as a prior for the lo¬ 
cal refinement, which lacks the context of the full-image. 
In our experiments we show that providing d -' as feature 
is critically necessary to achieve good accuracy in the lo¬ 
cal refinement. The first feature entry is set constant to 1 
in order to implement the bias term. Finally, the features 
..., (tF^^{xi) are radial basis functions computed 
with respect to centers. Note that while the radial ba¬ 
sis functions for the global model were defined in terms of 


deep features f{x) computed from the whole image, global 
features are clearly not appropriate for the local refinement. 
Instead, we propose to use the hypercolumn feature vec¬ 
tor ISl at pixel j, i.e., the activation values at location j in 
the convolutional feature maps of the deep PLACES net¬ 
work |[3^ all stacked into a single vector. In practice, we 
use only layers pool2, conv4 and conv5, which give rise to 
a hypercolumn vector of dimensionality 896 for each pixel. 
This representation has been shown to be able to simultane¬ 
ously capture localized low-level visual information (from 
the early layers) as well as high-level semantics (from the 
deepest layers). Thus, it is very useful for localized, high- 
level visual analysis, such as our task of local depth refine¬ 
ment. More formally, we compute the radial basis features 
as (jFjF^i) = exp(-||aj(cc^) - c^|p/2z/^) where aj() de¬ 
notes the function that extracts the hypercolumn representa¬ 
tion at pixel j and is the k-\h center, itself a hypercolumn 
feature vector. As discussed in further detail in section [33l 
the centers are the centroids computed by k-means over 
a training set of hypercolumn feature vectors. 

The parameter vector F is learned via simple regularized 
least-squares estimation on the training data: 

argminy]]^ (4-t^-(/)J(a;)) (7) 

^ i=l j=l k=3 

where the first two entries of F (corresponding to the bias 
and the global depth prediction) are left unregularized. 

At test time, given the input image x and its global depth 
estimate d^^, we obtain the locally-refined depth value d^^ 

at pixel j as d^^ = F • 0j(x). Finally, we take this depth 
estimate at the intermediate resolution (P/ pixels), resize it 
to the full resolution (R x C) using bilinear interpolation 
and apply once more the colorization scheme, in order to 
render the final output more spatially coherent. 

It is important to note that besides the use of local in¬ 
formation (rather than the context from the full image), an¬ 
other fundamental difference between our global depth es¬ 
timation and the refinement lies in the fact that the latter 
directly regresses on depth, while the former predicts depth 
reconstruction weights (i.e., the vector w). This is consis¬ 
tent with the distinct objectives of the two steps: the global 
estimate takes advantage of the basis constraint to yield a 
robust but coarse estimate of the depth map; the local re¬ 
finement can leverage the global depth estimate as a strong 
feature and thus can model the depth at individual pixels in 
an unconstrained fashion. 

3.3. Implementation details 

In this section we provide additional implementation de¬ 
tails concerning our approach. To learn the global depth 
model, we downsample the training depth maps from size 
427 X 561 to size 32 x 43 for NYUv2 and from 256 x 1242 
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to 32 X 156 for KITTI (in order to maintain aspect ratio 
of ground truth) via bilinear interpolation. The sizes were 
chosen to reduce the dimensionality sufficiently so as to al¬ 
low training of basis to happen without overfitting while at 
the same time producing a coarse depth map that contains 
meaningful information. Furthermore, we subtract the per 
pixel mean from each of the depth maps so as to force our 
model to predict the deviations from the mean depth map. 
At inference time we add the mean depth to our depth esti¬ 
mate to get the final prediction. As intermediate resolution 
P/, we use 128 x 172 for NYUv2 and 64 x 311 for KITTI. 

In the work by Krizhevsky et al. ca the deep network 
was applied to multiple crops of the image and the predic¬ 
tions on the individual crops were then averaged. Inspired 
by this approach, we defined five distinct image crops (Cen¬ 
ter (C), Upper Left (UL), Upper Right (UR), Down Left 
(DL), Down Right (DR)) of size 227 x 227 and we learned 
a distinct global model for each of the crops. However, 
note that all 5 models are trained to predict the complete 
depth map (thereby estimating also depth at pixels not in 
the crop). At inference, we generate the final depth at each 
pixel as a weighted average of the predictions from the 5 
crops. We use a spatially-varying weighting function of 
the 5 estimates at the coarse size. The weight of crop i 
at pixel location p is computed as A(p) = exp(—||p — 

Pi\\/l^)/ 1 exp (-1 b - Pj 11 /7^) where pi is the center 

pixel of crop i. Thus, at each pixel we give more importance 
to the predictions of crops that are closer to the pixel. 

For each crop, we form the vector centers cj used in the 
radial basis functions (j)j (x) by taking image examples from 
the two nearest crops. We use (UL,UR) as centers for C, 
(C,UR) as centers for UL, (C,UL) as centers for UR, (C,DR) 
as centers for DL and (C,DL) as centers for DR. We double 
the number of centers by including also the mirrored ver¬ 
sion of each crop in the kernel vector. For NYUv2, as the 
number of training examples is small (795) we use all train¬ 
ing images as centers. Thus the RBF kernel vector of each 
crop contains a total of 795 x 2 x 2 = 3180 centers (mirrored 
and un-mirrored version of each of the 2 closest crops for all 
795 images). For KITTI, since the training set is in this case 
much larger (19, 852 images), we use only a subset of it to 
create the RBF vector: specifically, for each crop we form 
the centers with the 654 examples that were used to train the 
framework of Saxena et al. cn, once again by choosing the 
mirrored and unmirrored versions of the 2 closest crops for 
all these images (this yields a total of 654 x 2 x 2 = 2616 
centers for each crop). The dj in the kernel is set to be half 
of the maximum pairwise distance between centers. 

For refinement, instead of learning a single shared model 
for all pixels of the image, we trained a separate pixel-based 
model for each block of 16 rows of the image (for a total of 8 
distinct models). This is motivated by the observation that 
pixels within a row (or in neighboring rows) of the image 


tend to have similar depth statistics but pixels coming from 
distant rows often exhibit large depth variations, as already 
noted in Saxena et al. 1^ . This is merely a consequence 
of ceilings being typically at the top of the image, walls in 
the middle and fioors at the bottom of the picture. Each 
model is trained with a 512 dimensional RBF kernel-vector 
augmented with the global depth estimate and the con¬ 
stant feature 1. The 512 RBF centers for each block of rows 
are the k-means cluster centroids obtained by clustering ran¬ 
domly sampled pixels from that block of rows in the training 
set. Since using the global depth estimates on the training 
set would overfit the data and generate biased estimate of 
the feature we performed a 10-fold cross validation 
on the training set and used the global depth estimates pre¬ 
dicted on each validation fold to generate the features for 
the subsequent training of the refinement. For each fold, we 
apply the procedure of training 5 different crops and merg¬ 
ing outputs. For colorization, we set the penalty value to 
0.001 for both GCL & RCL. 


4. Experiments 


In this work we apply our proposed approach to the 
NYUv2 1^ and KITTI Q datasets and show that it pro¬ 
duces state-of-the-art results on depth estimation for both. 
These two datasets are dramatically different and serve well 
the objective of showing that our approach works for both 
indoor and outdoor settings. 

Depth estimation can be quantitatively assessed accord¬ 
ing to different criteria. In this work, we report results on 
multiple metrics that are widely used: RMSE (H, Abso¬ 
lute Relative error m, Scale Invariant error (H, Threshold 
error ca, Log 10 error 1^ . For evaluation the output of 
both GCL and RCL is upsampled to full resolution before 
evaluation. This allows us to compare the global and the re¬ 
fined estimates on the same ground. Both RMSE and Log 10 
are measured in meters. 

The rest of this section is organized as follows: in [4.1 
we present results of our models on NYUv2 and compare 


them to the state-of-the-art; in §4.2| we discuss our experi¬ 
ments on KITTI; finally, in §4.3 we describe experimental 
results obtained by varying our model design choices, thus 
providing further empirical justification for our approach 
and the settings used in §4.1 [and ; 


4.1. NYUv2 

The NYUv2 dataset 1^ consists of RGBD examples 
from 27 different indoor scene categories taken from a to¬ 
tal of 464 different scenes. We evaluate our methods us¬ 
ing the standard train/test split provided by the authors of 
NYUv2 (795 training examples, 654 test examples) 1^ . 
For the global method we use a depth basis B consisting of 
m = 4S atoms. This provided a nice compromise between 


6 





Metric 

GCL 

RCL 


fiel 

nsr 



mi 

Mean Prediction 

Higher Better 

Th5 < (1.25) 

0.6083 

0.6096 

0.5179 

0.5422 

NR 

0.447 

NR 

0.614 

0.4284 


Rel 

0.2523 

0.2415 

0.2544 

NR 

0.374 

0.349 

0.335 

0.230 

0.4017 


LoglO 

0.0973 

0.0960 

0.1179 

NR 

0.134 

NR 

0.127 

0.095 

0.1444 

Lower Better 

Sc-Inv 

0.2382 

0.2363 

0.2719 

NR 

NR 

0.325 

NR 

NR 

0.3052 


RMSE 

0.8156 

0.8025 

0.9917 

NR 

1.12 

1.214 

1.060 

0.824 

1.2049 


Table 2. Quantitative Evaluation on NYUv2. Our models (GCL and RCL) outperform prior work by a large margin according to the RMSE 
metric. Our refinement (RCL) provides a small but consistent improvement over our global estimate (GCL). NR stands for not reported. 
Results on MakeSD were taken from the evaluation of Eigen et al. (U. Results for Karsch et al. GSl were taken from evaluation with 
correct train/test split done by Liu et al. im. The method in Ladicky et al. (H was trained with a different train/test split (725 Training, 
724 Testing). 


being able to approximate global depth and predictability of 
weights of the basis. 

We compare our approach on this benchmark with pub¬ 
lished state-of-the-art methods. The results are summarized 
in Table The last column reports the performance ob¬ 
tained by simply predicting the constant average depth map 
(computed from the training set) for any input, as this is an 
interesting baseline revealing the difficulty of the dataset. 
As can be seen, both our models outperform all prior meth¬ 
ods by a large margin on the RMSE (the metric we opti¬ 
mize for) and are highly competitive with other approaches 
according to the other performance measures. 

Note that we did not include in Table |2] the results of the 
method recently proposed by Eigen et al. and Wang et 
al. ll^ as these approaches were not trained on the standard 
training split of NYUv2. Both of these approaches use a 
training set that is 150 times larger than the one we employ 
in this work (only 795 images). The results for training with 
the expanded training set can be seen in Table ^ 




m 

Kj 

RCL-E 

Higher Better 

ThS < 1.25 

0.6170 

0.611 

0.6294 


Rel 

0.2289 

0.215 

0.2250 

Lower Better 

Sc-Inv 

0.228 

0.219 

0.2290 


RMSE 

0.8371 

0.907 

0.7846 


Table 3. Side-by-side comparison between the deep learning based 
methods ||4l[26l and our approach on the test set of NYUv2. RCL- 
E refers to our model learned on the the expanded training set (the 
same used by the other approaches EH El). 


4.2. KITTI 

The KITTI dataset is an outdoor scene dataset consisting 
of videos taken from a driving vehicle with depth provided 
by a LiDaR sensor. On this dataset we used the train/test 
split proposed by Eigen et al. (4l consisting of 19852 train¬ 
ing examples and 697 test examples. The training and test 
sets include examples from the “city”, “residential” and 


^The results for Wang et al. EU are different from what they report 
in their paper as they employ a non-standard evaluation method. We used 
depth estimates provided by the authors and ran our evaluation method to 
produce the results reported here and in the supplementary material. 


“road” sequences. Eor evaluation on this dataset, we use 
the same experimental setup adopted by Eigen et al. |4l. 

We train our global depth model using a basis B con¬ 
sisting of m = 96 atoms. Once again, we compare our 
estimates against the ground truth by resizing our estimates 
to full resolution. 

Table shows the results of our global model versus 
Eigen et al. El (because KITTI is a recent dataset we could 
not find any other prior work using this training/test split to 
include in the comparison).The Table shows that given the 
same training data, our approach achieves higher accuracy 
according to the RMSE and the Threshold metric, while it 
is close to the approach of Eigen et al. El on the Relative 
and Scale-Invariant metrics. 



Metric 

Coarse 

Refinement 




GCL 

13 

RCL 

13 

Mean 

Predict. 

Higher 

Better 

Th 

(5 < 1.25 

0.691 

0.679 

0.699 

0.692 

0.556 

Lower 

Better 

Rel 

0.218 

0.194 

0.206 

0.190 

0.412 

Sc-Inv 

0.262 

0.248 

0.260 

0.246 

0.359 

RMSE 

6.608 

7.216 

6.437 

7.156 

9.635 


Table 4. Quantitative Evaluation on the KITTI 111 dataset. Eval¬ 
uation was conducted on the test set proposed by Eigen et al. E) 


4.3. Revisiting Model Design Choices 

4.3.1 Global Estimation 

In this section we study the impact of various design choices 
made in our global approach. Table summarizes this 
comparative study of different variants of our global model 
(GCL) on both NYUv2 as well as KITTI. 

In this work we assumed that in order to capture the 
structure in the output space (depth spatial smoothness, re¬ 
jection of unlikely depth maps), it is beneficial to learn to 
predict reconstructive depth basis weights rather than re¬ 
gressing on depth directly. The second column of Table 
(Direct Regr) shows the performance obtained by learning 
a mapping that uses our kernel-based image features 0(cc) 
to directly regress on the depth d. As can be seen eliminat¬ 
ing the basis model and regressing depth directly causes an 
increase in RMSE error on both datasets thereby validating 
the need for a depth basis. 
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Another assumption in our approach is that coupling the 
learning of the basis and the regression provides a bene¬ 
ficial effect as it allows the method to optimize the depth 
representation for accurate prediction. Our hypothesis is 
confirmed by the results shown in the third column of Ta¬ 
ble [^(Uncoupled), which reports the performance obtained 
by learning a dictionary via sparse coding and then regress¬ 
ing on the weights of the dictionary. There is a clear degra¬ 
dation in accuracy on both datasets when the modeling of 
depth and the regression optimization are uncoupled. 

Finally, we assess which deep features are effective at 
predicting depth. We consider two types of features, both 
extracted from the same deep network architecture lITSll but 
trained on two different datasets: GCL uses ”pool5” fea¬ 
tures trained on PLACES (321, while GCL-I (last column 
of Tablej^ uses ”pool5” optimized for object class recogni¬ 
tion on Imagenet Our results show that features learned 
for scene classification perform much better on depth es¬ 
timation of scenes compared to features trained for object 
classification. 

4.3.2 Local Depth Refinement 

Here we present experiments that shed light on the role of 
different components of our local depth refinement (RCL). 

First, we assess the advantage of training separate mod¬ 
els for different row-blocks of the image. As discussed, 
for RCL we subdivided the image into 8 non-overlapping 
blocks of 16 rows and trained a distinct model for each 
block. We now take a look at the impact of using a sin¬ 
gle model as opposed to the multi-model setting. In order 
to construct an equally powerful single model, we construct 
a 8 X 512-sized RBF descriptor to train the single-model re¬ 
gressor. However, we found that this yields consistently in¬ 
ferior results compared to the multi-model, e.g., the RMSE 
on NYUv2 is 0.8213 versus the 0.8025 of RCL. 

In order to show the importance of estimating the global 
depth before the local refinement, we tried training a variant 
of RCL that does not include the global estimate d -' in the 
feature vector of Eq.[^ Effectively this model uses only the 
local hypercolumn vector to directly regress the depth of 
each pixel. This results in dramatically worse accuracy: the 
RMSE error on NYUv2 is 1.1211 instead to 0.8025! This 
furthers validates our belief that a good method for local 
depth estimation requires a really strong global model used 
as a guidance for further refinement. 

4.4. Analysis of Computational Cost 

We now show that our approach is both scalable and ex¬ 
tremely fast to train. We compare the computational cost of 
our approach to that of other competing methods ll4l [26ll2Ql . 
The deep learning approach described in Liu et al. (^ re¬ 
quires 33 hours for training with a GPU using the standard 



GCL 

Direct Regr 

Uncoupled 

GCL-I 

NYUv2 

0.8156 

0.8384 

0.8843 

0.8908 

KITTI 

6.6078 

6.7414 

6.7138 

6.9923 


Table 5. RMSE error for different variants of our global estimation 
method on NYUv2 and KITTI. GCL is our framework from Sec¬ 
tion 3.1. “Direct Regr” uses the image features to directly regress 
on depth (no basis learning). “Uncoupled” learns the basis via 
sparse coding and then trains a regression model on the learned 
weights using our features </)(cc). GCL-I corresponds to the use of 
Imagenet da (rather than PLACES) image features. 

training set (795 examples). In contrast, our global frame¬ 
work (GCL) requires approximately 15 minutes for feature 
extraction of the standard train/test split NYUv2 dataset and 
10 minutes for learning all 5 models on a Xeon E5 CPU. 
The systems described in Eigen et al. & Wang et al. (411^ 
use 136, 847 and 200, 000 training examples. The training 
of the coarse model in Eigen et al. (H takes 38 hours, while 
the model in Wang et al. (^ takes 4 days to train using 
GPUs. The training of the GCL model on the expanded 
training set (136, 847 examples) takes only 8 hours. GCL 
inference takes place in under a second. 

For refinement, the model by Eigen et al. (H takes 26 
hours for training. In comparison our refinement method 
(RCL) requires ^ hour to be trained (including the time 
needed to run k-mean for the RBE centroid computations). 
We train the independent models in parallel using a cluster, 
which makes the total training still | hour. RCL inference 
takes 8 seconds per image. 

5. Conclusion 

We presented a novel approach to depth estimation from 
single image that naturally integrates global and local in¬ 
formation. Global cues in the form of deep convolutional 
features are used to predict the global depth map. In a sub¬ 
sequent stage the estimated global depth map is used to 
guide local refinement at a higher resolution. Global esti¬ 
mation is formulated as the joint learning of a depth basis 
and a regression mapping from the image space to the basis 
weights. The local refinement regresses directly on pixel 
depth using the global estimate as feature. Our approach 
yields an improvement over the state-of-the-art on the stan¬ 
dard train/test split of the NYUv2 and KITTI datasets. Eur- 
thermore it is significantly faster and more scalable than 
prior systems. Euture work will involve integrating feature 
learning in the framework of coupled regression and mod¬ 
eling of depth. 
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